The dataset used in this work is the HTRU2 Data Set.
HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey (South).
Pulsars are a rare type of Neutron Star (this video shows an overview of what a Neutron Star is) that produce radio emission detectable here on Earth. Pulsar astronomers have now detected over 1500 pulsars and expect to discover thousands more during the next few years.
As pulsars rotate, their emission beam sweeps across the sky, and when this crosses our line of sight, produces a detectable pattern of broadband radio emission. As pulsars rotate rapidly, this pattern repeats periodically.
Each pulse is found to be made up of radio waves of different frequencies. It is observed that the highest frequencies of a pulse arrive at a telescope slightly before the lower frequencies. The reason for this is that the pulse has been travelling through the interstellar medium (the space between the pulsar and the Earth) and the different frequencies making up the pulse travel at different speeds through this medium. This is referred to as the pulse dispersion and is due to the free electrons in the interstellar medium.
Machine learning tools are now being used to automatically label pulsar candidates to facilitate rapid analysis. Classification systems in particular are being widely adopted, which treat the candidate data sets as binary classification problems. Here the legitimate pulsar examples are a minority positive class, and spurious examples the majority negative class.
The dataset shared here contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples. Each candidate is described by 8 continuous variables. The data extracted for each sample consists of 4 simple statistics derived from the folded pulse profile, and 4 statistics from the DM-SNR curve. These are summarised below:
Mean of the integrated profile : $Prof_{\mu}=\frac{1}{n}\sum_{i=1}^n p_i$
Standard deviation of the integrated profile : $Prof_{\sigma}=\sqrt{\frac{\sum_{i=1}^n(p_i-{Prof}_{\mu})^2}{n-1}}$
Excess kurtosis of the integrated profile : ${Prof}_k = \frac{\frac{1}{n}\sum_{i=1}^n(p_i-{Prof}_{\mu})^4}{(\frac{1} {n}\sum_{i=1}^n(p_i-{Prof}_{\mu})^2)^2}-3$
Skewness of the integrated profile : ${Prof}_s = \frac{\frac{1}{n}\sum_{i=1}^n(p_i-{Prof}_{\mu})^3} {\sqrt{\frac{1}{n-1}\sum_{i=1}^n(p_i-{Prof}_{\mu})^2}^3} $
In particular, the kurtosis is another measure of the central tendency of the distribution. It measures the number of samples in the tails of the distribution. The kurtosis $k$ for a normal distribution is $3$, and the excess kurtosis of a distribution is defined as $k-3$.
The skewness of the pulse profile measures the symmetry of the profile about some mean or mode of the data.
Mean of the DM-SNR curve : $DM_{\mu}=\frac{1}{n}\sum_{i=1}^n d_i$
Standard deviation of the DM-SNR curve : $DM_{\sigma}=\sqrt{\frac{\sum_{i=1}^n(d_i-{DM}_{\mu})^2}{n-1}}$
Excess kurtosis of the DM-SNR curve : ${DM}_k = \frac{\frac{1}{n}\sum_{i=1}^n(d_i-{DM}_{\mu})^4}{(\frac{1} {n}\sum_{i=1}^n(d_i-{DM}_{\mu})^2)^2}-3$
Skewness of the DM-SNR curve : ${DM}_s = \frac{\frac{1}{n}\sum_{i=1}^n(d_i-{DM}_{\mu})^3} {\sqrt{\frac{1}{n-1}\sum_{i=1}^n(d_i-{DM}_{\mu})^2}^3} $
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib
import plotly.express as px
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from imblearn.pipeline import Pipeline as imbPipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier as KNN
from imblearn.over_sampling import SMOTE
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from IPython.display import Image
from IPython.core.display import HTML
import warnings
warnings.filterwarnings("ignore")
#@title #####> Click here to show code
#load the datasets
dataset = pd.read_csv('HTRU_2.csv', sep=',')
dataset.columns = ['mean_profile', 'std_profile', 'kurtosis_profile', 'skewness_profile', 'mean_dmsnr',
'std_dmsnr', 'kurtosis_dmsnr', 'skewness_dmsnr', 'target']
#remove duplicates - if any
dataset = dataset.drop_duplicates(keep='first')
#print info of the dataframe
dataset.info()
The dataset is loaded as a Pandas Dataframe. The column names are then added: they refer to the ones cited in the above section.
Then duplicate rows, if any, are removed generating a new Dataframe made of 17897 entries and 9 attributes.
We can see that there are no missing values. All the attributes are float. The class label is the attribute target. It can only assume two values:
Finally, the first few lines of the Dataframe are shown as an example:
#@title #####> Click here to show code
dataset.head()
#@title #####> Click here to show code
dataset.describe()
Data exploration is the initial step in data analysis: it helps to create a broad picture of important trends and major points to study in greater detail. This process makes deeper analysis easier because it can help target future searches and begin the process of excluding irrelevant data points and search paths that may turn up no results. Plots, charts and other data visualization tools are essential in this phase.
First of all we study the correlation matrix in order to see how and which attributes are correlated in our dataset.
A correlation matrix is a table which displays the correlation coefficients for different variables. In particular, the matrix depicts the correlation between all the possible pairs of values.
A correlation matrix consists of rows and columns that show the variables. Each cell in a table contains the correlation coefficient, that is the Pearson correlation coefficient which is a statistic that measures the linear correlation between two variables X and Y.
Given a pair of random variables $(X,Y)$, the formula for $\rho$ is:
where $cov$ is the covariance, $\sigma_X$ is the standard deviation of $X$ and $\sigma_Y$ is the standard deviation of $Y$.
Recall that $cov(X,Y) = \mathbb{E}[(X-\mu_X)(Y-\mu_Y)]$ so the formula can be written as:
The correlation coefficient ranges from $−1$ to $1$. A value of $1$ implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of $−1$ implies that all data points lie on a line for which Y decreases as X increases. A value of $0$ implies that there is no linear correlation between the variables.
The image below shows some sets of $(x, y)$ points, with the correlation coefficient of $x$ and $y$ for each set.
#@title #####> Click here to show code
corr = dataset.corr(method='pearson')
plt.figure(figsize=(10,10), dpi=80)
sns.heatmap(corr,annot=True,
cmap=sns.color_palette("YlGnBu"),
linewidth=2,edgecolor="k")
plt.title("Correlation between features")
plt.show()
The line of $1$'s going from the top left to the bottom right is the main diagonal, which shows that each variable always perfectly correlates with itself. This matrix is symmetrical, with the same correlation that is shown above the main diagonal being a mirror image of those below the main diagonal.
From the matrix the following statements can be made:
kurtosis_profile and skewness_profile have very similar coefficients, leading to the conclusion that one of them could be removed. Also, they are positively correlated, with a coefficient value of $0.95$
the features mean_profile and std_profile, such as mean_dmsnr and std_dmsnr are positively correlated.
the features referring to the skewness and kurtosis (of both the integrated profile and the DM-SNR) are negatively correlated to the respective mean and standard deviation. For instance, the correlation coefficient between kurtosis_dmsnr and mean_dmsnr is $-0,62$ and the one between kurtosis_profile and mean_profile is $-0.87$
#@title #####> Click here to show code
dataset_corr = dataset.corr()['target'][:-1] # -1 because the latest row is target
golden_features_list = dataset_corr[abs(dataset_corr) > 0.5].sort_values(ascending=False)
print("There are {} strongly correlated values with target:\n{}".format(len(golden_features_list), golden_features_list))
The following plot shows both distribution of single variables and relationships between two variables. In particular, violet color is used for class 0 (non-pulsar signals) while green is used for class 1 (pulsar signal).
#@title #####> Click here to show code
with sns.axes_style("darkgrid"):
g = sns.pairplot(dataset[["mean_profile", "std_profile", "kurtosis_profile", "skewness_profile",
"mean_dmsnr", "std_dmsnr", "kurtosis_dmsnr", "skewness_dmsnr", "target"]],
hue="target",
#diag_kind="hist",
palette= 'mako'
)
for ax in g.axes.flat:
plt.setp(ax.get_xticklabels(), rotation=45)
Looking at the pair plots, we can confirm the fact that kurtosis_profile and skewness_profile are positively correlated (their coefficient was $0.95$). We can do the same reasoning with the attributes kurtosis_dmsnr and skewness_dmsnr: in the pair plot above they show a linear trend (their coefficient was $0.92$).
We can also notice that attributes like kurtosis_profile, std_profile, mean_profile are quite well separated from all the other attributes.
Then we visualize how the target variable is distributed: there are 16259 negative examples and 1639 positive examples.
Here we can clearly see that the dataset is highly imbalanced. Non pulsar signals examples are more than 90% of the dataset. An oversampling method may be needed in order to build a classifier that is not biased towards the majority class.
#@title #####> Click here to show code
with sns.axes_style("darkgrid"):
size = len(dataset.columns)
val='target'
pd.crosstab(dataset[val],dataset.target).plot(kind="bar",figsize=(10,5) )
plt.title('Classes distribution')
plt.xlabel('Label')
plt.ylabel('Frequency')
plt.show()
plt.pie(dataset["target"].value_counts().values,
labels=["not pulsar","pulsar"],
autopct="%1.0f%%",wedgeprops={"linewidth":2,"edgecolor":"white"})
my_circ = plt.Circle((0,0),.7,color = "white")
plt.gca().add_artist(my_circ)
plt.subplots_adjust(wspace = .2)
plt.title("Proportion of target variable in dataset")
plt.show()
The histogram is a graphical display of the distribution of a quantitative variable. It plots the number of observations that fall in intervals of values.
When examining the distribution of a quantitative variable, one should describe the overall pattern of the data (shape, center, spread), and any deviations from the pattern (outliers). When describing the shape of a distribution, one should consider:
Outliers are data points that fall outside the overall pattern of the distribution and need further research before continuing the analysis.
#@title #####> Click here to show code
with sns.axes_style("darkgrid"):
pd.DataFrame.hist(dataset, figsize = [15,15], bins=20)
plt.show()
Looking at the histograms, we can notice that:
Then we can plot the boxplot of each feature with respect to each class. Plotting the boxplot is also useful because in this way I can also have an indication of where outliers are located.
A box plot is used to visually show the distribution of numerical data and skewness through displaying the data quartiles (or percentiles) and averages. In particular they show:
The box plot shape will show if a statistical data set is normally distributed or skewed.
When the median is in the middle of the box, and the whiskers are about the same on both sides of the box, then the distribution is symmetric.
When the median is closer to the bottom of the box, and if the whisker is shorter on the lower end of the box, then the distribution is positively skewed (skewed right).
When the median is closer to the top of the box, and if the whisker is shorter on the upper end of the box, then the distribution is negatively skewed (skewed left).
#@title #####> Click here to show code
with sns.axes_style("darkgrid"):
plt.figure(figsize=(20,18))
for i in range(8):
plt.subplot(4,2,i+1)
sns.boxplot(data=dataset, y=dataset.columns[i], x="target")
plt.show()
Taking into account the fact that class 0 represents non pulsar signals while class 1 is for pulsars, first thing to notice is that for all the features the medians of class 0 and class 1 are quite different from each other, in fact the two distributions do not overlap.
Outliers are clearly visible from the boxplot, especially for non-pulsar signals.
Finally we can notice that some features (std_dsmnr, mean_dsmnr, skewness_profile) show a skewed distribution. Let's take a look at the violin plots and at the histograms to have a more accurate understanding:
#@title #####> Click here to show code
# plot distributions
def plot_dists(df, col_1, col_2='target'):
fig, axis = plt.subplots(1, 2, figsize=(16, 5))
sns.distplot(df[col_1], ax=axis[0])
axis[0].set_title('distribution of {}. Skewness = {:.4f}'.format(col_1 ,df[col_1].skew()))
sns.violinplot(x=col_2, y=col_1, data=df, ax=axis[1], inner='quartile')
axis[1].set_title('violin of {}, split by target'.format(col_1))
plt.show()
for col in dataset.columns[:-1]:
with sns.axes_style("darkgrid"):
plot_dists(dataset, col)
It's crystal clear that we have a lot of outliers. They are quite continuous in the tails so we can look at them not as erroneous records, but as rarer observations that still carry some information.
Since we noticed from the histograms that the variability of the features differs a lot from one another, a data standardization phase is needed. In particular, we decided to use z-score normalization. The features are rescaled such that they have properties of a standard normal distribution with mean zero and standard deviation of one.
with mean $\mu = \frac{1}{n}\sum_i^nx_i$ and variance $\sigma = \sqrt{\frac{\sum_{i=1}^n(x_i-{\mu})^2}{n}}$
Also, standardization is important because many ML algorithms require features to be normalized (e.g. KNN, SVM, logistic regression). When we perform PCA, the features have to be normalized before fitting the model: Principal Component Analysis is interested in the features that maximize the variance so the scale of the attributes must be the same.
In particular, we perform standardization using the StandardScaler from the sklearn library.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
To give an example of its behavior, standardization here is performed on all the dataset: the model will be fitted later on with the train dataset.
#@title ##### Mean and Std Deviation of dataset before Standardization
dataset.describe().loc[['mean','std'],:]
#@title ##### Mean and Std Deviation of dataset after Standardization
scaler = StandardScaler()
#create a copy of the original dataset
dataset_std = dataset.copy()
#fit + transform
scaler.fit(dataset[dataset.columns.values[:-1]])
dataset_std[dataset_std.columns.values[:-1]] = scaler.transform(dataset_std[dataset_std.columns.values[:-1]], copy=True)
dataset_std.describe().loc[['mean','std'],:]
Among the possible ways of handling correlation between features, we opted for the usage of Principal Component Analysis. PCA is a dimensionality reduction algorithm that may be used whenever the number of feature is very high. It generates new orthogonal axes by a linear combination of the original attributes. These new features are uncorrelated by definition. Another way of getting rid of the aforementioned problem, could be dropping one of the two attributes. Either solutions are possible, but PCA was preferred, because it significantly reduces correlation between every attribute, even if it is already low.
Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance.
Basically, Principal Components are a new coordinate system.
We take the whole dataset (we will refer at the dataset with $B$) consisting of $D+1$ dimensions and ignore the labels such that our new dataset becomes $D$ dimensional. The idea of PCA is that data points will lie mainly in a linear subspace of dimension $d < D$ which maintains most of the variability in the data. So there will be $d$ new variables describing the subspace. These variables are called Principal Components and will be denoted as $u_1, u_2, ..., u_d$ and they are orthogonal linear transformations of the original variables $x_1, x_2, ..., x_D$. For instance, a Principal Component would be expressed as: $u_j = w_1^{(j)}x_1 + ... + w_D^{(j)}x_D$.
In particular the first PC is the one with the maximum variance and subsequent PC will take up successively smaller parts of the total variability.
Covariance matrix $S$ plays a key role in PCA: variance and oher informations about the dataset can be found inside it.
$S$ is the sample covariance, which is a $D×D$ matrix and it is calculated as $S = \frac{1}{n-1} BB^T$.
We have that $S$ is symmetric, infact:
If $B$ is any $D×n$ matrix of real numbers, then the $D×D$ matrix $BB^T$ and the $n×n$ matrix $B^TB$ are both symmetric.
Also, we have that for each $1\leq i,j\leq D$
Since $S$ is a symmetric matrix (meaning that $S^T=S$), it can be orthogonally diagonalized and has only real eigenvalues $\lambda_1 \geq ... \geq \lambda_D \geq 0$ with eigenvectors $u_1, ..., u_D$
The eigen vector $u_1$ points in the direction of the maximal variance if $\lambda_1$ is the max eigenvalue of $S$ and it corresponds exactly to the first PC. All the Principal Components are constructed in this way and each is an eigenvector of $S$ and their corresponding eigenvalues satisfy $\lambda_1 \geq ... \geq \lambda_D $
To gain further intuition into the relationships between the points, we can use PCA to project them to a 2-dimensional space, and plot the first two principal components of each point to learn about the data:
#@title #####> Click here to show code
#create a copy of the standardized dataset
dataset_pca2 = dataset_std.copy()
projected = dataset_pca2.drop('target', axis=1)
pca = PCA(n_components = 2) # project from 8 to 2 dimensions
pca.fit(projected) #fit only on attributes
projected = pca.transform(projected)
#print(f"Shape of the original dataset is {dataset[:-1].shape}")
print(f"Shape of the dataset after PCA is {projected.shape}")
#plot it!
with sns.axes_style("darkgrid"):
plt.figure(figsize=(12,10))
plt.scatter(projected[:, 0], projected[:, 1],
c=dataset_pca2['target'], edgecolor='none', alpha=0.5,
cmap='plasma')
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.colorbar();
Blue color is for samples belonging to class 0 (=non-pulsar signals) while yellow color corresponds to class 1 (=pulsar). The full data is made of 8-dimensional points and the plot shows the projection of each data point along the directions with the largest variance. Essentially, we have found the optimal stretch and rotation in 8-dimensional space that allows us to see the layout of the signals in two dimensions, having done this in an unsupervised manner that is without reference to the labels.
From the plot we can notice that even with 2 dimensions, the data points are quite well separated in the space. In addition to this, we noticed from the boxplots that the distribution of both pulsar and non-pulsar records differ a lot from each other: we expect to perform classification reaching good performances.
The plot below shows the data in 3 dimensions:
#@title #####> Click here to show code
import plotly.express as px
#create a copy of the standardized dataset
dataset_pca = dataset_std.copy()
X = dataset_pca.drop('target', axis=1)
pca = PCA(n_components=3)
components = pca.fit_transform(X)
total_var = pca.explained_variance_ratio_.sum() * 100
fig = px.scatter_3d(
components, x=0, y=1, z=2, color = dataset['target'],
size= [1 for i in range(len(dataset))], opacity=1,
title=f'Total Explained Variance: {total_var:.2f}%',
labels={'0': 'PC 1', '1': 'PC 2', '2': 'PC 3'}
)
# tight layout
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show(renderer="notebook")
We already said that $S$ is the covariance matrix. The trace (we will refer as $T$) of $S$ is the sum of the diagonal entries of $S$, that corresponds to the sum of the variances of all the variables. $T$ is also the sum of all the eigenvalues, $T= \lambda_1 + \lambda_2 + ... + \lambda_D$.
The direction in $\mathbb{R}^D$ given by $u_1$ accounts for an amount $\frac{\lambda_1}{T}$ of the total variance, and so on.
#@title #####> Click here to show code
# Divide the dataset
X = dataset.drop(['target'], axis=1)
y = dataset[['target']]
# Construct the training and testing splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
# Create an object of the standard scaler from sklearn
scaler = StandardScaler()
# Fit the standard scalar to the training data
scaler.fit(X_train)
# Use the transform function to normalize the data sets
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
#create a copy of the standardized dataset
dataset_pca = X_train.copy()
X = dataset_pca.copy()
pca = PCA()
pca.fit(X)
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)
fig = px.area(
x=range(1, exp_var_cumul.shape[0] + 1),
y=exp_var_cumul,
labels={"x": "# Components", "y": "Explained Variance"}
)
fig.show(renderer="notebook")
From the plot we can see that only with 2 components the explained variance reaches very high values. In this work I decided to keep a number of principal components that would have been able to explain at least $95\%$ of the total variance so the number of new features is 5.
Here we use the model:
from sklearn.decomposition import PCA
pca = PCA(n_components=5)
of the Sklearn library that expects as input the standardized dataset. PCA generates 5 new features which are uncorrelated and this is clearly evident if we take a look at the correlation matrix:
#@title #####> Click here to show code
#create a copy of the standardized dataset
dataset_pca = dataset_std.copy()
X = dataset_pca.drop('target', axis=1)
pca = PCA(n_components=5)
X = pca.fit_transform(X)
df_pca = pd.DataFrame(data=X, columns=["PC1","PC2","PC3","PC4", "PC5"])
#plot it!
corr = df_pca.corr(method='pearson')
plt.figure(figsize=(7,6), dpi=80)
sns.heatmap(corr,annot=True,
cmap=sns.color_palette("YlGnBu"),
linewidth=2,edgecolor="k")
plt.title("Correlation between features after PCA")
plt.show()
So far we have seen from the plot of the target distribution that the dataset is highly unbalanced: 91% of the records belong to class 0, while 9% are from class 1. In these situations an oversampling method may be useful in order to build a classifier that is not biased: due to the disparity of classes in the variables, the algorithm would tend to categorize into the class with more instances, giving the false sense of a highly accurate model.
To avoid this problem I decided to perform a particular technique called $SMOTE$ (Synthetic Minority Over-sampling Technique) which is known to be very efficient on datasets which have this high degree of unbalance and when points belonging to different classes lay in non-overlapping regions of the space.
$SMOTE$ first selects a minority class instance $a$ at random and finds its $k$ nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors $b$ at random and connecting $a$ and $b$ to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances $a$ and $b$.
In particular, $SMOTE$ was performed using the class from imblearn library:
from imblearn.over_sampling import SMOTE
oversample = SMOTE(random_state=42)
I decided to do not set any particular value of $k$ so the default value of $k=5$ will be used.
An important thing that has to be underlined is that I decided to apply the oversampling technique to each fold of the dataset generated during the cross-validation phase: this is the only way in which I can train my model with a balanced training dataset and test it with an unbalanced test set.
The HTRU dataset was split into training and test set with a proportion 75:25
After applying standardization to the training dataset, K-Fold Cross-Validation was performed with the following models:
The models were fitted on different versions of the dataset:
Since we are performing a classification task, we have to define the metrics in order to evaluate the classification. For instance, in a supervised learning scenario like the one we are examining, we first fit/train a model on the training data $X\_train$, then test the model on $X\_test$. Then we compare the model's predictions to the true labels and we will have a count of correctly matched and a count of incorrect matches but all the matches do not hold equal value in reality, therefore we will need more then a single metric to evaluate our predictions.
The confusion matrix provides an insightful picture: it shows which classes are being predicted correctly and incorrectly, and what type of errors are being made. Here, the abbreviations stand for:
The equations of the 4 key evaluation metrics are:
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. This method refits a model of interest to samples formed from the training set, in order to obtain additional information about the fitted model.
Idea is to randomly divide the data into $K$ equal-sized parts. We leave out part $k$, fit the model to the other $K-1$ parts (combined), and then obtain predictions for the left-out $k^{th}$ part. This is done in turn for each part $k = 1,2,...K$ and then the results are combined. The procedure is often called k-fold cross-validation.
It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.
Besides, when the algorithm treated has more than one hyperparameter, the configurations of hyperparameters tested are generated by means of a Grid Search.
#@title #####> Click here to show code
# Divide the dataset
X = dataset.drop(['target'], axis=1)
y = dataset[['target']]
# Construct the training and testing splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
# Create an object of the standard scaler from sklearn
scaler = StandardScaler()
# Fit the standard scalar to the training data
scaler.fit(X_train)
# Use the transform function to normalize the data sets
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
def searchHyper(pipeline, grid, X_train, y_train):
gsCV = GridSearchCV(pipeline, param_grid = grid, scoring='f1_weighted',
verbose = 0, cv=5, n_jobs=-1)
gsCV.fit(X_train, y_train)
print()
print(f"Best parameters set found on development set: {gsCV.best_params_}")
return gsCV
def applyModel(gsCV):
print("Detailed classification report:")
y_pred = gsCV.best_estimator_.predict(X_test)
print(classification_report(y_test, y_pred))
print()
#F1 score
f1 = f1_score(y_test, y_pred, average='weighted')
print(f"F1-score is: {f1}")
#plot confusion matrix
plt.figure(figsize=(10,7))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='.1f', cmap=sns.color_palette("Blues", 25))
plt.show()
The K-Nearest Neighbors algorithm is a supervised learning algorithm that is often used for classification. It is based on the idea that samples belonging to the same class are close to each other.
In this work we'll use the Euclidean distance as metric. It is defined as:
given two points $X, Y \in \mathbb{R}^n$, the distance is ${\sqrt{\sum _{k=1}^{n}{(x_{k}-y_{k})^{2}}}}$.
$KNN$ is a lazy learning algorithm because it does not have a specialized training phase and uses all the data for training while classification. It is also a non-parametric learning algorithm because it doesn’t assume anything about the underlying data.
During classification, the algorithm follows these steps:
Choose the value of $K$ (the number of nearest data points): it can be any integer.
For each point $C$ in the test data do:
In KNN, decision boundaries change along with the parameter $K$; in fact we can think of $K$ as controlling the shape of the decision boundary.
As stated before, $K$ in KNN is a parameter that refers to the number of nearest neighbors to include in the majority of the voting process. In particular, choosing smaller values for $K$ can be noisy and will have a higher influence on the result. Larger values of $K$ will have smoother decision boundaries which mean lower variance but increased bias; also, computationally expensive.
For KNN models, complexity is determined by the value of $K$ (lower value = more complex).
#@title #####> Click here to show code
#make pipeline for PCA + SMOTE + KNN
pipeline_pca_smt_knn = imbPipeline([('pca', PCA(n_components=5)),
('smote', SMOTE(random_state=42)),
('classifier', KNN())
])
#make pipeline for SMOTE + KNN
pipeline_smt_knn = imbPipeline([
('smote', SMOTE(random_state=42)),
('classifier', KNN())
])
#make pipeline for PCA + KNN
pipeline_pca_knn = imbPipeline([('pca', PCA(n_components=5)),
('classifier', KNN())
])
#make pipeline for KNN
pipeline_knn = imbPipeline([
('classifier', KNN())
])
# Set the parameters
grid_pca_knn = {'classifier__n_neighbors': [3, 4, 5, 7 ]}
grid_knn = {'classifier__n_neighbors': [3, 4, 5, 7 ]}
gsCV = searchHyper(pipeline_knn, grid_knn, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_pca_knn, grid_pca_knn, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_smt_knn, grid_knn, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_pca_smt_knn, grid_pca_knn, X_train, y_train)
applyModel(gsCV)
In this section we will explore another important model in ML: Support Vector Machines. The aforementioned algorithm will be approached and applied on a 2-class classification problem.
The main idea of SVM models is that we want to find an optimal hyperplane that separates the classes in the feature space.
A hyperplane in $p$ dimensions is a flat affine subspace of dimension $p-1$ and it has the form $b + w_1 x_1+...+ w_p x_p = 0$ (that is equal to $<w,x>+b=0$). The vector $w = (w_1, ..., w_p)$ is called the normal vector and it points in a direction orthogonal to the surface of the hyperplane.
Among all separating hyperplanes, SVM finds the one that makes the biggest margin between the two classes. So the goal of this type of algorithm is to find the hyperplane having the largest margin where the margin $\delta$ is defined as the distance of the closest example from the decision hyperplane, and the closest data-points are known as support vectors.
When we want to solve a classification problem, we have a set of points $(x_i, y_i)$ that we want to classify. Here $y_i$ is the label and, since we are dealing with a binary classification problem, it can be $+1$ or $-1$.
These regions can be also identified with hyperplanes which are parallel to the decision boundary.
So we can define the margin explicitly in terms of these hyperplanes and, since no data are inside them, we will define the margin as the distance between them: we can think at the margin as twice the distance from the decision boundary to the nearest training point (margin = $\frac{2}{\|w\|}$). The points on the margin are called support vectors.
We say that the points are linearly separable if $y_i(⟨w,x_i⟩+b)>0$, for each $i=1,...,p$
Hard-SVM is the learning rule in which we select a hyperplane that separates the training set with the largest possible margin. Therefore, the Hard-SVM rule is:
This is the Primal Optimization Problem, which can be rewritten in its dual form:
Which allows us to write $w$ as a linear combination of the support vectors.
In cases in which linear separation is impossible (the problem is not linear), the solution is to make the margin soft, adding a relaxation factor:
Here, the parameter $C$ is the regularization factor and together with $\xi$, which expresses the distance between the misclassified sample and the margin, it's able to allow misclassification by making the margin soft.
In particular, $C$ is the inverse of regularization strength:
#@title #####> Click here to show code
#make pipeline for PCA + SMOTE + SVC
pipeline_pca_smt_svc = imbPipeline([('pca', PCA(n_components=5)),
('smote', SMOTE(random_state=42)),
('classifier', svm.LinearSVC())
])
#make pipeline for SMOTE + SVC
pipeline_smt_svc = imbPipeline([
('smote', SMOTE(random_state=42)),
('classifier', svm.LinearSVC())
])
#make pipeline for PCA + SVC
pipeline_pca_svc = imbPipeline([('pca', PCA(n_components=5)),
('classifier', svm.LinearSVC())
])
#make pipeline for SVC
pipeline_svc = imbPipeline([
('classifier', svm.LinearSVC())
])
# Set the parameters
grid_pca_svc = {'classifier__C': [ 0.1, 1, 10, 100]}
grid_svc = {'classifier__C': [ 0.1, 1, 10, 100]}
gsCV = searchHyper(pipeline_svc, grid_svc, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_pca_svc, grid_pca_svc, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_smt_svc, grid_svc, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_pca_smt_svc, grid_pca_svc, X_train, y_train)
applyModel(gsCV)
SVM algorithm is also known for the use of the kernel trick. The kernel is a way of computing the dot product of two vectors in some (very high dimensional) feature space, which is why kernel functions are sometimes called generalized dot product.
Applying kernel trick means just to replace dot product of two vectors by the kernel function.
We will refer to a kernel function for a mapping $\psi$ as a function that implements the inner product in the feature space and it is defined as $K(x,x')=⟨\psi(x),\psi(x')⟩$.
In particular, $K(x,x')$ can be a kernel function if and only if it respects the Mercer Theorem:
The Gram matrix $G_{i,j} = K(x_i,x_j)$ needs to be positive semidefinite, so $x^TGx\geq 0$ for each $X$ and the eigenvalues need to be non-negative.
Gaussian RBF(Radial Basis Function) is a popular Kernel method used in SVM models. The transformation is the following:
The SVC model with RBF kernel has non-linear boundaries, unlike the LinearSVC model which was only able to generate linear ones. This behavior is made possible thanks to the use of kernels. Non-linear boundaries generated from $SVC(kernel=’rbf’)$ vary according to two parameters: $C$ and $\gamma$, which are tuned during 5-Fold Cross Validation. In fact, SVM with rbf kernel has a new parameter, $\gamma$: it can be thought of as the ‘spread’ of the kernel and therefore the decision region. On the other hand, parameter $C$ represents the penalty for misclassification: low values for $C$ correspond to a model which is tolerant of misclassified data point; with high values for $C$ the classifier becomes intolerant to misclassified data points.
#@title #####> Click here to show code
#make pipeline for PCA + SMOTE + SVC
pipeline_pca_smt_rbf = imbPipeline([('pca', PCA(n_components=5)),
('smote', SMOTE(random_state=42)),
('classifier', svm.SVC(kernel='rbf'))
])
#make pipeline for SMOTE + SVC
pipeline_smt_rbf = imbPipeline([
('smote', SMOTE(random_state=42)),
('classifier', svm.SVC(kernel='rbf'))
])
#make pipeline for PCA + SVC
pipeline_pca_rbf = imbPipeline([('pca', PCA(n_components=5)),
('classifier', svm.SVC(kernel='rbf'))
])
#make pipeline for SVC
pipeline_rbf = imbPipeline([
('classifier', svm.SVC(kernel='rbf'))
])
# Set the parameters
grid_pca_rbf = {'classifier__C': [ 0.1, 1, 10, 100],
'classifier__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
grid_rbf = {'classifier__C': [ 0.1, 1, 10, 100],
'classifier__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}
gsCV = searchHyper(pipeline_rbf, grid_rbf, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_pca_rbf, grid_pca_rbf, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_smt_rbf, grid_rbf, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_pca_smt_rbf, grid_pca_rbf, X_train, y_train)
applyModel(gsCV)
Decision Tree is a tree-based method used for both classification and regression tasks. These methods involve stratifying or segmenting the predictor space into a number of simple regions.
Decision Trees used for classification are supervised learning methods and they are used to predict a qualitative response (which is, in our case, 0 and 1). Decision trees learn from data to approximate a set of if-then-else decision rules. The final result is a tree with decision nodes and leaf nodes. The deeper the tree, the more complex the decision rules and the fitter the model.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the tree. This algorithm compares the values of root attribute with the record attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and move further. It continues the process until it reaches the leaf node of the tree. The complete process can be better understood using the below algorithm:
Decision Tree is a top-down algorithm because it begins at the top of the tree and then successively splits the predictor space; it is greedy because at each step of the tree-building process, the best split is made at that particular step.
An optimal tree size is chosen adaptively from the training data. The recommended approach is to build a fully-grown decision tree and then extract a nested sub-tree (pruneing) in a way that you are left with a tree that compromises between purity and complexity.
In order to calculate node impurity and to select the right attribute on which the split is performed, we can use these metrics:
Gini Index : is a measure of impurity or purity used while creating a decision tree in the algorithm. An attribute with the low Gini index should be preferred as compared to the high Gini index. It is calculated as:
$G = \sum_{k=1}^K \hat{p}_{mk}(1-\hat{p}_{mk})$ where $k$ is the number of unique labels in that particular split and $\hat{p}_{mk}$ is the proportion of training observations in the $m^{th}$ region from the $k^{th}$ class.
Cross-Entropy : it is another measure of node impurity. Like Gini index, the $m^{th}$ node is purer if the entropy $D$ is smaller. It is calculated as:
$D = -\sum_{k=1}^K log(\hat{p}_{mk})$
Misclassification Error : it is the fraction of the training observations in the $m^{th}$ node that do not belong to the most common class. It is calculated as:
$E = 1-max_k \hat{p}_{mk}$
When growing a decision tree, Gini Index or Entropy are typically used to evaluate the quality of split. However, for pruning the tree, Misclassification Error is used.
#@title #####> Click here to show code
#make pipeline for PCA + SMOTE + DT
pipeline_pca_smt_dt = imbPipeline([('pca', PCA(n_components=5)),
('smote', SMOTE(random_state=42)),
('classifier', DecisionTreeClassifier())
])
#make pipeline for SMOTE + DT
pipeline_smt_dt = imbPipeline([
('smote', SMOTE(random_state=42)),
('classifier', DecisionTreeClassifier())
])
#make pipeline for PCA + DT
pipeline_pca_dt = imbPipeline([('pca', PCA(n_components=5)),
('classifier', DecisionTreeClassifier())
])
#make pipeline for DT
pipeline_dt = imbPipeline([
('classifier', DecisionTreeClassifier())
])
grid_dt = {'classifier__criterion': ["gini", "entropy"],
'classifier__max_depth': [10,13, 14, 15,20]}
gsCV = searchHyper(pipeline_dt, grid_dt, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_pca_dt, grid_dt, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_smt_dt, grid_dt, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_pca_smt_dt, grid_dt, X_train, y_train)
applyModel(gsCV)
Two major limitations of decision trees are that they are prone to overfitting, and that they tend to be non-robust, meaning a small change in the training data results in a very different tree. Random Forest models overcome these two shortcomings of decision trees by generating many decision trees, and then aggregating the predictions of each individual tree to a single model prediction.
Random forest is an ensemble machine learning method that creates multiple decision trees and then combines the trees into a single model by aggregating the individual tree predictions.
To decorrelate the trees that make up a random forest, a process called bootstrap aggregating (also known as bagging) is conducted. Bagging generates new training datasets from an original dataset by sampling the original training data with replacement. This is repeated for as many decision trees that will make up the random forest. Each individual bootstrapped dataset is then used to construct a tree.
A similar process called the random subspace method (also called attribute bagging or feature bagging) is also implemented to create a random forest model.
When building the decision trees, each time a split in a tree is considered, a random selection of $m$ predictors is chosen as split candidates from the full set of $p$ predictors (typically the number of new features is $m=\sqrt{p}$). The split is allowed to use only one of those $m$ predictors.
This further decorrelates the trees by preventing dominant predictor variables from being the first or only variables selected to create splits in each of the individual decision trees.
The combination of bagging and the random subspace method result in a random forest model.
When a data point $x$ has to be classified, predictions from all the trees are done and the label is assigned by majority voting.
#@title #####> Click here to show code
#make pipeline for PCA + SMOTE + RF
pipeline_pca_smt_rf = imbPipeline([('pca', PCA(n_components=5)),
('smote', SMOTE(random_state=42)),
('classifier', RandomForestClassifier())
])
#make pipeline for SMOTE + RF
pipeline_smt_rf = imbPipeline([
('smote', SMOTE(random_state=42)),
('classifier', RandomForestClassifier())
])
#make pipeline for PCA + RF
pipeline_pca_rf = imbPipeline([('pca', PCA(n_components=5)),
('classifier', RandomForestClassifier())
])
#make pipeline for RF
pipeline_rf = imbPipeline([
('classifier', RandomForestClassifier())
])
grid_rf = {'classifier__criterion': ["gini", "entropy"],
'classifier__max_depth': [5, 10, 15],
'classifier__n_estimators': [10, 100, 500, 1000]}
gsCV = searchHyper(pipeline_rf, grid_rf, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_pca_rf, grid_rf, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_smt_rf, grid_rf, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_pca_smt_rf, grid_rf, X_train, y_train)
applyModel(gsCV)
Before talking about Logistic Regression, I think it's useful to talk about the the Linear model. Linear regression is the simplest statistical technique for predictive modelling analysis. It is a way to explain the relationship between a dependent variable (target) and one or more explanatory variables(predictors) using a straight line.
In order to perform Linear Regression:
An example of multiple linear regression is:
$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n$
To solve this equation (= we want to find the coefficients $\beta_0, \beta_1, ...$) the method of minimizing the least squares or the sum of squared errors is used. It measures the squared difference between our observed value $y_i$ and our predicted value $\hat{y}_i$. Here, the error is calculated with $e = y_i - \hat{y}_i$
Logistic Regression is a particular Generalized Linear Model. Generalized Linear Model are an extension of Linear Regression, whenever the relationship between the predictors and the dependent variables is not linear.
When we have a classification problem, we want to predict the binary output variable Y, which can assume 2 values: either 1 or 0. This is represented by a Bernoulli variable.
Here, the assumption of normality is violated since classification is not normally distributed.
Moreover, both mean and variance depend on the underlying probability. Any factor that affects the probability will change not just the mean but also the variance of the observations, which means the variance is no longer constant, violating the assumption of homoscedasticity. As a result, we cannot directly apply linear regression because it won't be a good fit.
As we are now looking for a model for probabilities, we should ensure the model predicts values on the scale from 0 to 1. A powerful model Generalized linear model (GLM) is suited for these situations allowing response variables that have arbitrary distributions (other than only normal distributions), and by using a link function to vary linearly with the predicted values rather than assuming that the response itself must vary linearly with the predictor.
In Logistic Regression, we don’t directly fit a straight line to our data like in linear regression. Instead, we fit a S shaped curve, called Sigmoid, to our observations.
By computing the sigmoid function of $\beta_0 + \beta_1X$ we get a probability of an observation belonging to one of the two categories. The formula of the sigmoid function is the following:
To transform the model from linear regression to logistic regression using the logistic function, we have:
</center> Where $P$ corresponds to the probability of success ($1$) and $1-P$ corresponds to the probability of failure ($0$). If $P$ is greater than 0.5 class 1 is assigned, otherwise, class 0 will be predicted.
#@title #####> Click here to show code
#make pipeline for PCA + SMOTE + LOGISTIC REGRESSION
pipeline_pca_smt_lr = imbPipeline([('pca', PCA(n_components=5)),
('smote', SMOTE(random_state=42)),
('classifier', LogisticRegression())
])
#make pipeline for SMOTE + LOGISTIC REGRESSION
pipeline_smt_lr = imbPipeline([
('smote', SMOTE(random_state=42)),
('classifier', LogisticRegression())
])
#make pipeline for PCA + LOGISTIC REGRESSION
pipeline_pca_lr = imbPipeline([('pca', PCA(n_components=5)),
('classifier', LogisticRegression())
])
#make pipeline for LOGISTIC REGRESSION
pipeline_lr = imbPipeline([
('classifier', LogisticRegression())
])
# Set the parameters
grid_pca_lr = {'classifier__C': [0.01, 0.1, 1, 10, 100]}
grid_lr = {'classifier__C': [0.01, 0.1, 1, 10, 100]}
gsCV = searchHyper(pipeline_lr, grid_lr, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_pca_lr, grid_pca_lr, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_smt_lr, grid_lr, X_train, y_train)
applyModel(gsCV)
gsCV = searchHyper(pipeline_pca_smt_lr, grid_pca_lr, X_train, y_train)
applyModel(gsCV)
Here is the plot referring to the F1-score obtained on the test set with all the dataset configurations and with all the models. I decided to plot only the resulting F1-score because of the unbalance in the dataset: what we are trying to achieve with the F1-score metric is to find an equal balance between precision and recall, which is extremely useful in most scenarios when we are working with imbalanced datasets.
#@title #####> Click here to show code
'''
knn = [0.97817, 0.9765, 0.9553, 0.9534]
svcL = [0.9719, 0.9710, 0.9685, 0.9658]
svcK = [0.9783, 0.9758, 0.9721, 0.9708]
dec_tree = [0.9712, 0.9694, 0.9529, 0.9480]
rand_forest = [0.9778, 0.9765, 0.9710, 0.9699]
log_reg = [0.9764, 0.9756, 0.9671, 0.9636]
'''
std = [0.97817, 0.9719, 0.9783, 0.9712, 0.9778, 0.9764]
std_pca = [0.9765, 0.9710, 0.9758, 0.9694, 0.9765, 0.9756]
std_smt = [0.9553, 0.9685, 0.9721, 0.9529, 0.9710, 0.9671]
std_pca_smt = [0.9534, 0.9658, 0.9708, 0.9480, 0.9699, 0.9636]
x = ["knn", "LinearSVC", "SVC_RBF", "DecisionTree", "RandForest", "LogisticReg"]
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Histogram(histfunc="sum",
y=std,
x=x,
name="std"))
fig.add_trace(go.Histogram(histfunc="sum",
y=std_pca,
x=x,
name="std+pca"))
fig.add_trace(go.Histogram(histfunc="sum",
y=std_smt,
x=x,
name="std+smote"))
fig.add_trace(go.Histogram(histfunc="sum",
y=std_pca_smt,
x=x,
name="std+pca+smote"))
fig.update_yaxes(range=[0, 1])
fig.show(renderer="notebook")
We can notice that:
In general, with all the models, the best performances are achieved when the fitting is performed on the standardized dataset.
The best models are KNN (0.9781), SVC with RBF kernel (0.9783) and Random Forest (0.9778).
In general, we can notice that this dataset is good for its own for classification: applying $PCA$ didn't bring any significant improvement, infact we have already an handable number of dimensions and our models are not affected by the unbalance at all, so applying $SMOTE$ lead to a tiny drop in performances.